Visualization of Public Trees in Vancouver#

The Vancouver trees dataset contains a listing of public trees on boulevards in the City of Vancouver and provides data on tree coordinates, species and other related characteristics.

For more information, see: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?disjunctive.species_name&disjunctive.common_name&disjunctive.height_range_id&disjunctive.on_street&disjunctive.neighbourhood_name.

In this example, I investigate the top 10 trees present in the dataset, and look at their prevalence within the city (which neighbourhoods they can be found in) and how the distribution of these trees (ie. how many are being planted each year, of each species) has changed over time. In addition, I look at how tree properties (diameter and height) vary between the species and neighbourhoods.

I use a combination of the following plots:

  • heat map

  • bar chart

  • line chart

  • geographic map

  • scatter plot

Description and Review of Data#

# Import libraries needed for this assignment
import altair as alt
import pandas as pd
# Read in the file. Let's immediately parse the "date_planted" column into DateTime dtype.
trees_df = pd.read_csv('https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv', parse_dates=["date_planted"])
trees_df.head(10)
Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name genus_name assigned ... plant_area curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude
0 10747 W 20TH AV W 20TH AV PLATANOIDES Riley Park 2000-02-23 28.5 EVEN ACER N ... 15 Y 21421 NORWAY MAPLE 4 0 NaN N 49.252711 -123.106323
1 12573 W 18TH AV W 18TH AV CALLERYANA Arbutus-Ridge 1992-02-04 6.0 ODD PYRUS N ... 7 Y 129645 CHANTICLEER PEAR 2 2300 CHANTICLEER N 49.256350 -123.158709
2 29676 ROSS ST ROSS ST NIGRA Sunset NaT 12.0 ODD PINUS N ... 7 Y 154675 AUSTRIAN PINE 4 7800 NaN N 49.213486 -123.083254
3 8856 DOMAN ST DOMAN ST AMERICANA Killarney 1999-11-12 11.0 EVEN FRAXINUS N ... 7 Y 180803 AUTUMN APPLAUSE ASH 4 6900 AUTUMN APPLAUSE N 49.220839 -123.036721
4 21098 EAST BOULEVARD EAST BOULEVARD HIPPOCASTANUM Shaughnessy NaT 15.5 ODD AESCULUS Y ... N Y 74364 COMMON HORSECHESTNUT 4 5200 NaN N 49.238514 -123.154958
5 17458 BUTE ST BUTE ST PERSICA West End 2012-04-05 3.0 EVEN PARROTIA N ... C Y 233622 VANESSA PERSIAN IRONWOOD 1 1100 VANESSA N 49.281906 -123.133076
6 1476 PRESTWICK DRIVE NASSAU DRIVE CAMPESTRE Victoria-Fraserview NaT 12.0 ODD ACER N ... 15 Y 105171 HEDGE MAPLE 3 1700 NaN N 49.217522 -123.071311
7 5120 FLEMING ST FLEMING ST OFFICINALIS Kensington-Cedar Cottage 2001-04-02 3.0 EVEN MAGNOLIA N ... N Y 187792 CHINESE MAGNOLIA 2 3700 NaN N 49.251127 -123.071912
8 18338 W PENDER ST W PENDER ST PALUSTRIS Downtown 1999-12-17 8.0 ODD QUERCUS N ... C Y 104016 PIN OAK 1 100 NaN N 49.281303 -123.108253
9 28279 MATAPAN CRESCENT MATAPAN CRESCENT ZUMI Renfrew-Collingwood 2008-03-13 3.0 ODD MALUS N ... 12 Y 102612 REDBUD CRABAPPLE 1 3200 CALOCARPA Y 49.257272 -123.030023

10 rows × 21 columns

trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Unnamed: 0          5000 non-null   int64         
 1   std_street          5000 non-null   object        
 2   on_street           5000 non-null   object        
 3   species_name        5000 non-null   object        
 4   neighbourhood_name  5000 non-null   object        
 5   date_planted        2363 non-null   datetime64[ns]
 6   diameter            5000 non-null   float64       
 7   street_side_name    5000 non-null   object        
 8   genus_name          5000 non-null   object        
 9   assigned            5000 non-null   object        
 10  civic_number        5000 non-null   int64         
 11  plant_area          4950 non-null   object        
 12  curb                5000 non-null   object        
 13  tree_id             5000 non-null   int64         
 14  common_name         5000 non-null   object        
 15  height_range_id     5000 non-null   int64         
 16  on_street_block     5000 non-null   int64         
 17  cultivar_name       2658 non-null   object        
 18  root_barrier        5000 non-null   object        
 19  latitude            5000 non-null   float64       
 20  longitude           5000 non-null   float64       
dtypes: datetime64[ns](1), float64(3), int64(5), object(12)
memory usage: 820.4+ KB

There are 5000 entries within the data frame, of type int64, object and float64 (and I have changed the date_planted column to datetime64). Columns “data_planted”, “plant_area”, and “cultivar_name” contain null or NaN values. Specifically “date_planted” and “cultivar_name” have very many values missing; it may therefore be better to drop these columns - but that, of course, depends on the questions of interest and what we want to explore in our data analysis. Given that I want to investigate how the number of trees of each species being planted each year has changed over time, I will NOT drop the date_planted column.

# Let's see some summary statistics
trees_df.describe()
Unnamed: 0 date_planted diameter civic_number tree_id height_range_id on_street_block latitude longitude
count 5000.000000 2363 5000.000000 5000.000000 5000.000000 5000.00000 5000.000000 5000.000000 5000.000000
mean 14861.920400 2003-09-06 04:03:08.912399488 12.340888 2975.707600 128682.584600 2.73440 2960.227000 49.247349 -123.107128
min 2.000000 1989-10-31 00:00:00 0.000000 2.000000 36.000000 0.00000 0.000000 49.202783 -123.220560
25% 7192.750000 1997-11-06 00:00:00 4.000000 1300.500000 61321.500000 2.00000 1300.000000 49.230152 -123.144178
50% 14870.000000 2003-02-12 00:00:00 10.000000 2639.000000 130130.500000 2.00000 2600.000000 49.247981 -123.105861
75% 22366.750000 2009-11-17 00:00:00 18.000000 4123.000000 191332.000000 4.00000 4100.000000 49.263275 -123.063484
max 29992.000000 2019-05-07 00:00:00 71.000000 9113.000000 270750.000000 9.00000 9100.000000 49.293930 -123.023311
std 8680.023278 NaN 9.266600 2078.580429 75412.260406 1.56957 2086.861052 0.021251 0.049137
# Finally, let's use value_counts() to see how many different "species_names" and "common_names" there are, and just to see what types of strings these columns contain.
top_trees_species_names = trees_df["species_name"].value_counts()
top_trees_species_names
species_name
SERRULATA      463
PLATANOIDES    444
CERASIFERA     396
RUBRUM         261
AMERICANA      182
              ... 
GRANDIFLORA      1
LAEVIS           1
LOEBNERI  X      1
SERRULA          1
LUTEA            1
Name: count, Length: 171, dtype: int64
top_trees_common_names = trees_df["common_name"].value_counts()
top_trees_common_names
common_name
KWANZAN FLOWERING CHERRY       383
PISSARD PLUM                   295
NORWAY MAPLE                   215
CRIMEAN LINDEN                 152
PYRAMIDAL EUROPEAN HORNBEAM    100
                              ... 
CHINESE WINGNUT                  1
ELM SPECIES                      1
UMBRELLA CATALPA                 1
MAGNOLIA 'MERRILL'               1
SWEETGUM SPECIES                 1
Name: count, Length: 361, dtype: int64

Questions of Interest#

I want to answer the following questions in my analysis:

  1. What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?

  2. How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time?

  3. Can we visualize the total tree counts per neighbourhood on a map?

In addition, I want to explore how tree properties (diameter and height) vary between the species and neighbourhoods.

Question 1. What is the prevalence of the top 10 tree species within the city (which neighbourhoods can they be found in)?#

As seen earlier in this assignment (and below), the following are the top ten species: SERRULATA, PLATANOIDES, CERASIFERA, RUBRUM, AMERICANA, SYLVATICA, BETULUS, EUCHLORA X, FREEMANI X, and CAMPESTRE.

Let’s filter our dataframe to only look at these species.

top_trees_species_names.nlargest(10)
species_name
SERRULATA       463
PLATANOIDES     444
CERASIFERA      396
RUBRUM          261
AMERICANA       182
SYLVATICA       178
BETULUS         170
EUCHLORA   X    152
FREEMANI   X    127
CAMPESTRE       124
Name: count, dtype: int64

Just out of curiousity, I looked up these trees online. Serrulata is the “Japanese cherry”, Platanoides the “Norway maple”, Cerasifera the “Cherry plum”, Rubrum the “Red maple”, Americana the “Linden tree”, Sylvatica the “Sour gum”, Betulus the “European hornbeam”, Euchlora the “Caucasian linden”, Freemani the “Freeman maple”, and Campestre the “Field maple.” These are all decidous trees.

top10_trees = ["SERRULATA", "PLATANOIDES", "CERASIFERA", "RUBRUM", "AMERICANA", "SYLVATICA", "BETULUS", "EUCHLORA   X", "FREEMANI   X", "CAMPESTRE"]
# Creating a new dataframe to populate with the top 10 species data
trees_df_top10 = pd.DataFrame(columns=trees_df.columns)

# Let's use a for-loop to filter our trees_df dataframe, and add the top 10 species to our new trees_df_top10 dataframe.
for tree in top10_trees:
    trees_toadd = trees_df[trees_df["species_name"].str.contains(tree)]
    trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])

trees_df_top10 = trees_df_top10.reset_index()
trees_df_top10.head()
C:\Users\celle\AppData\Local\Temp\ipykernel_8360\32736908.py:7: FutureWarning: The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.
  trees_df_top10 = pd.concat([trees_df_top10, trees_toadd])
index Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name genus_name ... plant_area curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude
0 19 17945 W 12TH AV W 12TH AV SERRULATA Kitsilano 2008-03-13 9.0 ODD PRUNUS ... 20 Y 106587 SHIROTAE(MT FUJI) CHERRY 1 2600 SHIROTAE N 49.261319 -123.164948
1 21 28441 ST. CATHERINES ST E 49TH AV SERRULATA Sunset NaT 14.0 ODD PRUNUS ... 4 Y 44256 KWANZAN FLOWERING CHERRY 3 800 KWANZAN N 49.225494 -123.087200
2 42 24476 W 35TH AV W 35TH AV SERRULATA Shaughnessy NaT 11.0 EVEN PRUNUS ... 12 Y 33656 KWANZAN FLOWERING CHERRY 2 2000 KWANZAN N 49.239992 -123.152677
3 44 16997 VENABLES ST VERNON DRIVE SERRULATA Strathcona NaT 22.0 ODD PRUNUS ... 7 Y 115638 UKON JAPANESE CHERRY 3 800 UKON N 49.277064 -123.079379
4 60 1292 CAMOSUN ST CAMOSUN ST SERRULATA Dunbar-Southlands NaT 16.0 ODD PRUNUS ... N Y 204485 KWANZAN FLOWERING CHERRY 2 4400 KWANZAN N 49.246430 -123.196900

5 rows × 22 columns

# Let's plot a heat map to see which trees are present in each neighbourhood, and how many. 
# I've added a tooltip to help see how many trees exactly are denoted in the heat map. I've also added a select tool, to enable the selection of one of the neighbourhoods.

select_neighbourhood_click = alt.selection_point(encodings=["y"], on='click', nearest=True)
tree_plot = alt.Chart(trees_df_top10).mark_rect().encode(alt.X('species_name', title="Species name"), alt.Y('neighbourhood_name', title="Neighboorhood name"), color=('count()'), tooltip=[alt.Tooltip("count()", title="Number of trees")], opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")
tree_plot.add_params(select_neighbourhood_click)

In the EDA, I initially use a simple mark_rect plot to visualize this data. I quickly realized that using a heat map would be better, because it would allow me to not only see if a species is present in a neighbourhood, but how many trees of the species are present.

Although the above plot demonstrates that there are certain neighbourhoods with greater tree counts than others, it also shows that almost all of the neighbourhoods have at least one exemplar of each of the top 10 tree species. It seems as though these trees are pretty well distributed throughout the city!

Question 2. How has the distribution of these trees (ie. how many are being planted each year, of each species) changed over time?#

Has this been different over the different neighbourhoods?#

# First, let's filter out the trees that do not have a "date_planted" value
trees_filtered_df = trees_df_top10[~pd.isnull(trees_df_top10["date_planted"])].reset_index()
trees_filtered_df.head()
level_0 index Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name ... plant_area curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude
0 0 19 17945 W 12TH AV W 12TH AV SERRULATA Kitsilano 2008-03-13 9.00 ODD ... 20 Y 106587 SHIROTAE(MT FUJI) CHERRY 1 2600 SHIROTAE N 49.261319 -123.164948
1 8 114 1978 SLOCAN ST SLOCAN ST SERRULATA Renfrew-Collingwood 2011-01-18 3.25 EVEN ... B Y 21236 KWANZAN FLOWERING CHERRY 1 3400 KWANZAN N 49.253228 -123.049443
2 16 253 10562 CHALDECOTT ST CHALDECOTT ST SERRULATA Dunbar-Southlands 2009-04-24 12.00 EVEN ... N Y 15443 KWANZAN FLOWERING CHERRY 2 4400 KWANZAN N 49.247000 -123.192180
3 18 263 15849 W 30TH AV W 30TH AV SERRULATA Arbutus-Ridge 1989-11-08 24.00 ODD ... 7 Y 123108 KWANZAN FLOWERING CHERRY 4 2700 KWANZAN N 49.245210 -123.167140
4 22 300 183 W 40TH AV W 40TH AV SERRULATA Shaughnessy 1996-05-31 13.50 ODD ... 10 Y 168916 KWANZAN FLOWERING CHERRY 2 1600 KWANZAN N 49.235750 -123.144273

5 rows × 23 columns

Our initial trees_df_top10 contained 2497 trees. Now we have only 1053 trees in our dataframe.

# Let's add a column to our trees_filtered_df to extract the year a tree was planted from the "date_planted" column.
trees_filtered_df = trees_filtered_df.assign(year_planted = trees_filtered_df.date_planted.dt.year)
trees_filtered_df.head()
level_0 index Unnamed: 0 std_street on_street species_name neighbourhood_name date_planted diameter street_side_name ... curb tree_id common_name height_range_id on_street_block cultivar_name root_barrier latitude longitude year_planted
0 0 19 17945 W 12TH AV W 12TH AV SERRULATA Kitsilano 2008-03-13 9.00 ODD ... Y 106587 SHIROTAE(MT FUJI) CHERRY 1 2600 SHIROTAE N 49.261319 -123.164948 2008
1 8 114 1978 SLOCAN ST SLOCAN ST SERRULATA Renfrew-Collingwood 2011-01-18 3.25 EVEN ... Y 21236 KWANZAN FLOWERING CHERRY 1 3400 KWANZAN N 49.253228 -123.049443 2011
2 16 253 10562 CHALDECOTT ST CHALDECOTT ST SERRULATA Dunbar-Southlands 2009-04-24 12.00 EVEN ... Y 15443 KWANZAN FLOWERING CHERRY 2 4400 KWANZAN N 49.247000 -123.192180 2009
3 18 263 15849 W 30TH AV W 30TH AV SERRULATA Arbutus-Ridge 1989-11-08 24.00 ODD ... Y 123108 KWANZAN FLOWERING CHERRY 4 2700 KWANZAN N 49.245210 -123.167140 1989
4 22 300 183 W 40TH AV W 40TH AV SERRULATA Shaughnessy 1996-05-31 13.50 ODD ... Y 168916 KWANZAN FLOWERING CHERRY 2 1600 KWANZAN N 49.235750 -123.144273 1996

5 rows × 24 columns

# Let's take our trees_filtered dataframe and group by species_name and year.
trees_by_species_and_year = trees_filtered_df.groupby(["species_name", trees_filtered_df.date_planted.dt.year]).size().reset_index().rename(columns = {0: "tree_count"})
trees_by_species_and_year
species_name date_planted tree_count
0 AMERICANA 1992 1
1 AMERICANA 1993 5
2 AMERICANA 1994 6
3 AMERICANA 1995 1
4 AMERICANA 1996 3
... ... ... ...
231 SYLVATICA 2014 7
232 SYLVATICA 2015 1
233 SYLVATICA 2017 1
234 SYLVATICA 2018 4
235 SYLVATICA 2019 2

236 rows × 3 columns

# When we check the dataframe info, we can see that during the above transformations, the year_planted column got changed to int64 dtype. 
trees_filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   level_0             1053 non-null   int64         
 1   index               1053 non-null   int64         
 2   Unnamed: 0          1053 non-null   object        
 3   std_street          1053 non-null   object        
 4   on_street           1053 non-null   object        
 5   species_name        1053 non-null   object        
 6   neighbourhood_name  1053 non-null   object        
 7   date_planted        1053 non-null   datetime64[ns]
 8   diameter            1053 non-null   float64       
 9   street_side_name    1053 non-null   object        
 10  genus_name          1053 non-null   object        
 11  assigned            1053 non-null   object        
 12  civic_number        1053 non-null   object        
 13  plant_area          1044 non-null   object        
 14  curb                1053 non-null   object        
 15  tree_id             1053 non-null   object        
 16  common_name         1053 non-null   object        
 17  height_range_id     1053 non-null   object        
 18  on_street_block     1053 non-null   object        
 19  cultivar_name       904 non-null    object        
 20  root_barrier        1053 non-null   object        
 21  latitude            1053 non-null   float64       
 22  longitude           1053 non-null   float64       
 23  year_planted        1053 non-null   int32         
dtypes: datetime64[ns](1), float64(3), int32(1), int64(2), object(17)
memory usage: 193.5+ KB
# Let's change it back to datetime, so that we don't have trouble plotting.
trees_filtered_df['year_planted'] = pd.to_datetime(trees_filtered_df['year_planted'], format='%Y')
trees_filtered_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1053 entries, 0 to 1052
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   level_0             1053 non-null   int64         
 1   index               1053 non-null   int64         
 2   Unnamed: 0          1053 non-null   object        
 3   std_street          1053 non-null   object        
 4   on_street           1053 non-null   object        
 5   species_name        1053 non-null   object        
 6   neighbourhood_name  1053 non-null   object        
 7   date_planted        1053 non-null   datetime64[ns]
 8   diameter            1053 non-null   float64       
 9   street_side_name    1053 non-null   object        
 10  genus_name          1053 non-null   object        
 11  assigned            1053 non-null   object        
 12  civic_number        1053 non-null   object        
 13  plant_area          1044 non-null   object        
 14  curb                1053 non-null   object        
 15  tree_id             1053 non-null   object        
 16  common_name         1053 non-null   object        
 17  height_range_id     1053 non-null   object        
 18  on_street_block     1053 non-null   object        
 19  cultivar_name       904 non-null    object        
 20  root_barrier        1053 non-null   object        
 21  latitude            1053 non-null   float64       
 22  longitude           1053 non-null   float64       
 23  year_planted        1053 non-null   datetime64[ns]
dtypes: datetime64[ns](2), float64(3), int64(2), object(17)
memory usage: 197.6+ KB
# Now let's re-create our tree_plot on this trees_filtered_df, since we decreased the amount of our data by over half.
# Also, we want to use this plot later in a dashboard with other plots made using this reduced/filtered dataframe.
tree_plot_filtered = alt.Chart(trees_filtered_df).mark_rect().encode(
    alt.X('species_name', title="Species name"), 
    alt.Y('neighbourhood_name', title="Neighboorhood name"), 
    color=alt.Color('count()'), 
    tooltip=[alt.Tooltip("count()", title="Number of trees")], 
    opacity=alt.condition(select_neighbourhood_click, alt.value(0.9), alt.value(0.2))).properties(title="Count of trees within neighbourhoods")

# Add a title with instructions for how to use the interactivity.
tree_plot_title = alt.TitleParams("Count of trees within neighbourhoods",
     subtitle = "Click within the chart to select a neighbourhood to highlight.", 
     anchor = 'middle', 
     fontSize = 14,
     subtitleFontSize = 12)

tree_plot_filtered = tree_plot_filtered.add_params(select_neighbourhood_click)
tree_plot_filtered = tree_plot_filtered.properties(title=tree_plot_title)
tree_plot_filtered
# Let's use a stacked bar chart to see how the distribution of different species being planted each year has changed over time. 
# This type of chart enables one to see at the same time the TOTAL number of trees planted in a year, and (via coloured bars), how many trees of each species make up this total.
# I've added interactivity by enabling clicking on the legend to zone in on a particular species (one or multiple).
legend_select = alt.selection_point(fields=['species_name'], bind='legend')
total_tree_bar_plot_int = alt.Chart(trees_filtered_df).mark_bar().encode(
    x=alt.X('year_planted', title="Year"), 
    y=alt.Y('count()', title = "Trees planted"), 
    color=alt.Color('species_name', scale=alt.Scale(domain=top10_trees), title="Species name"), 
    opacity=alt.condition(legend_select, alt.value(0.9), alt.value(0.2))).properties(title="Total trees planted per year") 

total_tree_bar_plot_int = total_tree_bar_plot_int.transform_filter(select_neighbourhood_click).transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)
total_tree_bar_plot_int
# We can also make a line chart of ALL of the trees planted per year.
trees_by_year_plot = alt.Chart(trees_filtered_df).mark_line().encode(alt.X('year_planted', title=None), alt.Y('count()', title = "Trees planted"))
trees_by_year_plot

It looks like a high number of trees (between 60 and 140) were planted between the years 1992 and 2013. Then the number of trees being planted dropped drastically. It would be interesting to see how this relates to the political party in power or the funding given to the parks board… but that is not something I am exploring in this analysis.

# Let's combine the two plots above. 
# As in the course notes, we can use a selection interval on the line chart to select the year range that we are interested in looking at on the bar graph that identifies the different species.
select_year = alt.selection_interval()
interval_chart = trees_by_year_plot.properties(height=50).add_params(select_year)
bar_chart = total_tree_bar_plot_int.encode(x=alt.X('year_planted', title=None, scale=alt.Scale(domain=select_year))).properties(title="", height=200)
year_chart = bar_chart & interval_chart

# Add a title with instructions for how to use the interactivity.
year_chart_title = alt.TitleParams("Total trees planted per year",
     subtitle = "Click on the species name (one or multiple) to select species. Use the lower chart to select the year range to zoom in on.", 
     anchor = 'middle', 
     fontSize = 14,
     subtitleFontSize = 12)

year_chart = year_chart.properties(title=year_chart_title)
year_chart

This nice visualization allows us to zone in on the particular range of years that we are interested in, and then explore which species (singular or plural) of trees were planted in those years.

Question 3. Can we visualize the total tree counts per neighbourhood on a map?#

# Following the instructions provided in the course notes, I will create a map of Vancouver.
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote
Data({
  format: DataFormat({
    property: 'features',
    type: 'json'
  }),
  url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})
# Here is the base Vancouver map.
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
    color = 'gray', opacity= 0.5, stroke='white').encode().project(type='identity', reflectY=True)

vancouver_map
# Now let's create another dataframe that we can use to plot points (in the correct location, based on latitude and longitude) of the total tree counts.
trees_by_hood = trees_filtered_df.groupby(by="neighbourhood_name").size().reset_index().rename(columns = {0: "tree_count"})
trees_by_hood
trees_by_hood_lat_lon = trees_filtered_df.groupby(by="neighbourhood_name").median(numeric_only=True).reset_index().drop(columns=["diameter"]) 
trees_by_hood_lat_lon

map_trees_df = pd.merge(trees_by_hood, trees_by_hood_lat_lon, left_on='neighbourhood_name', right_on="neighbourhood_name", how="inner")
map_trees_df
neighbourhood_name tree_count level_0 index latitude longitude
0 Arbutus-Ridge 32 1514.5 2050.0 49.251766 -123.161059
1 Downtown 44 1861.0 2736.0 49.279161 -123.120819
2 Dunbar-Southlands 47 1320.0 2482.0 49.244350 -123.186220
3 Fairview 25 1500.0 1918.0 49.263053 -123.129507
4 Grandview-Woodland 42 1550.5 2043.5 49.271694 -123.064417
5 Hastings-Sunrise 82 1525.5 2555.0 49.275150 -123.043930
6 Kensington-Cedar Cottage 94 1613.0 2210.5 49.242945 -123.074047
7 Kerrisdale 47 1471.0 2295.0 49.229408 -123.154256
8 Killarney 44 1683.5 2376.5 49.220517 -123.035917
9 Kitsilano 38 1405.0 2731.5 49.262380 -123.153851
10 Marpole 50 1777.5 2896.5 49.212110 -123.129391
11 Mount Pleasant 35 1909.0 1998.0 49.262858 -123.099438
12 Oakridge 45 1451.0 2799.0 49.227775 -123.123889
13 Renfrew-Collingwood 91 1510.0 2268.0 49.245406 -123.040583
14 Riley Park 62 1701.0 2644.5 49.245382 -123.100527
15 Shaughnessy 42 877.0 2667.5 49.243628 -123.139753
16 South Cambie 29 1887.0 2427.0 49.246578 -123.119656
17 Strathcona 12 1576.5 1852.5 49.282518 -123.091634
18 Sunset 70 1577.0 2752.5 49.221410 -123.093632
19 Victoria-Fraserview 73 1760.0 2373.0 49.220275 -123.064658
20 West End 25 1821.0 2168.0 49.285731 -123.131542
21 West Point Grey 24 849.0 2126.0 49.264220 -123.208290
# We can use the above dataframe as the basis of our 'points' visualization. Let's make the points white, with a black stroke.
points = alt.Chart(map_trees_df).mark_circle(stroke="black").encode(
    longitude='longitude',
    latitude='latitude',
    size=alt.Size('tree_count:Q', title="Tree count"),
    color=alt.Color(value='white'),
    tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type= 'identity', reflectY=True)

points
# To achieve the interactivity I would like in my final dashboard, I will create another layer to my map. The neighbourhoods, once clicked on in my heat map chart, will be highlighted in this map layer.
# I will make this layer green, to demonstrate that Vancouver is a "green" city of trees.
van_map = alt.Chart(data_geojson_remote).mark_geoshape().transform_lookup(
    lookup='properties.name',
    from_=alt.LookupData(map_trees_df, 'neighbourhood_name', ['tree_count', 'neighbourhood_name'])).encode(
    opacity = alt.condition(select_neighbourhood_click, alt.value(1), alt.value(0.2)),
    color = alt.Color(value="#005C29"),
    tooltip=[alt.Tooltip('neighbourhood_name:N', title='Neighbourhood'), alt.Tooltip('tree_count:Q', title='Total number of trees')]).project(type='identity', reflectY=True).transform_filter(select_neighbourhood_click).add_params(select_neighbourhood_click)

# Combining all of the maps together creates an object that I can use in my dashboard.
points_map = vancouver_map + van_map + points
points_map

Interactive Dashboard#

# Now finally, let's combine all of our plots. 
# Let's make sure to transform the tree plot according to the legend_select, and add both the select_neighbourhood_click and legend_select selections.
tree_plot_filtered = tree_plot_filtered.transform_filter(legend_select).add_params(select_neighbourhood_click, legend_select)

# I will add a title to indicate that the data demonstrate that Vancouver has a large distribution of tree species within all neighbourhoods.
overall_title = alt.TitleParams(
    "Vancouver is a city of trees!",
     subtitle = "Top 10 tree species well represented within all neighbourhoods", 
     anchor = 'middle', 
     fontSize = 20,
     subtitleFontSize = 16)

(tree_plot_filtered | year_chart & points_map).properties(title=overall_title)

This dashboard visualization nicely allows a user to interact between three variables: species name, neighbourhood name, and year planted. By clicking between the heat map and bar and line plots, the number of different trees of each species, per neighbourhood and year, can be visualized. The map at the bottom doesn’t allow a user to click on it and interact with it, but rather just displays where within Vancouver each nieghbourhood can be found. The points on the map also nicely summarize the total tree count (over all the years and species) in each neighbourhood.

Bonus - some extra additions… widgets!#

I made the conscious choice of not using widgets on my dashboard because I liked the elegant interactivity of clicking and selecting between the above plots. After a considerable amount of time playing around with different widget options, I decided that widgets don’t really add to the above visualization, and rather clutter and complicate it.

Nevertheless, to demonstrate my ability to add widgets to charts, I have added slider and dropdown widgets to a scatter plot below.

# Here I explored how tree height range and diameter are influenced by species and neighbourhood.
# I used a slider to choose the height_range_id to highlight in the scatter plot by changing its size. I added dropdowns to enable species and neighbourhood selection.

# On this chart, I adjusted the scale and size of the plot to zone in on the data. There was one outlier point, with very large diameter, which I decided to "clip" to enable better visualization of the other points.

scatter_plot = alt.Chart(trees_filtered_df).mark_point(clip=True).encode(
    x=alt.X('height_range_id', scale=alt.Scale(domain=[0,10]), axis=alt.Axis(tickCount=9), title="Height range id (scale of 1 to 9)"), 
    y=alt.Y('diameter', scale=alt.Scale(domain=[0,40]), title = "Diameter (in)")).properties(title="Tree diameter vs. height range")

slider_height = alt.binding_range(name='Height range ', min=1, max=9, step=1)
select_height = alt.selection_point(
    fields=['height_range_id'],
    bind=slider_height,
    )

neighbourhoods = sorted(trees_filtered_df['neighbourhood_name'].unique())
dropdown_neighbourhoods = alt.binding_select(name='Neighbourhood ', options=neighbourhoods)
select_neighbourhood = alt.selection_point(fields=['neighbourhood_name'], bind=dropdown_neighbourhoods)

species = sorted(top10_trees)
dropdown_species = alt.binding_select(name='Species ', options=species)
select_species = alt.selection_point(fields=['species_name'], bind=dropdown_species)

scatter_plot = scatter_plot.add_params(select_neighbourhood, select_species, select_height).encode(
    opacity=alt.condition(select_neighbourhood, alt.value(1), alt.value(0.05)), 
    size = alt.condition(alt.datum.height_range_id < select_height.height_range_id, alt.value(100), alt.value(10)), 
    color=alt.condition(select_species, alt.value('purple'), alt.value('gray'))).properties(height=500, width=800) 
scatter_plot

The above interactive plot demontrates that too many interactive options on a plot make things too confusing, and don’t add information. Also, the points somewhat obstruct each other. As mentioned before, I felt that my interactive dashboard was already complete without widgets, so made the conscious choice of not adding them there.

Discussion and Concluding Remarks#

This final assignment demonstrated the amazing interactivity possible by Altair.
Given that the initial dataset we were given to work with only contained a subset of all of the trees, it is difficult to say whether or not the conclusions reached below are correct, but the following are a few observations/conclusions I made when interacting with the data:

  • Just a note to remind the reader that my dataframe was filtered to contain only the trees that contained a “date planted” value, and only for the top 10 species. So the total dataframe of 5000 trees was cut down to one of only 1053.

  1. The top 10 species are quite well represented across all neighbourhoods, with most neighbourhoods containing at least 7 of the 10 species.

Only Strathcona falls below this cut-off, with only 6 of the top 10 species represented. However, when we look at the total tree count within Strathcona (via the tooltip on the Vancouver map), we see that this neighbourhood also has only 12 trees total (within this filtered dataset). Renfrew-Collingwood had the largest number of trees, 91. It would be interesting to create a map indicating average trees/area across the city. This would be a better way of comparing the neighbourhoods, as certain neighbourhoods are larger than others, and so just “total trees” is not be directly comparable between a large neighbourhod and a small one. Regardless, when looking at species distribution, most species are very well represented throughout the city.

  1. There was quite a good split of different species being planted each year, with almost all species having several trees planted across the city each year.

Initially, when I created my EDA, I looked at ALL of the different tree species within the dataset, and created an area chart to compare which trees were being planted each year. This was WAY too much information. I decided, in this assignment, to narrow down to the top 10 species. This is, however, still a lot of different data to look at. I think the interactive bar chart would be particularly helpful if a user was interested in comparing 2 or 3 of the different species, and their planting trends over a period of time.

When I started this analysis, I wanted to compare the prevalence of deciduous and evergreen trees. I quickly realized, however, that the majority of the top 25-30 most common species in the dataset are deciduous. This was quite interesting and surprising. It seems as though the City of Vancouver prefers planting deciduous trees, as opposed to the evergreen trees that are native to this area (cedar, douglas fir, spruce, etc.) Perhaps these trees are already so common in the city, that the choice is made not to plant them? It would be interesting to look into this further.

  1. 2016 was a terrible year for planting trees.

As mentioned previously, it would be interesting to see what happened politically in Vancouver in this year, or whether parks board funding was cut for some reason, or what happened to cause the terrible planting year in 2016.

I used a combination of plots to answer my questions, including a:

  • heat map

  • bar chart

  • line chart

  • geographic map

  • scatter plot

References#

  1. Vancouver trees dataset: https://opendata.vancouver.ca/explore/dataset/street-trees/information/?

  2. Data Visualization sample final project for inspiration and coding help

  3. Data Visualization course notes for coding examples and syntax